Selective downstream cache processing for data access

ABSTRACT

A first request is received to access a first set of data in a first cache. A likelihood that a second request to a second cache for the first set of data will be canceled is determined. Access to the first set of data is completed based on the determining the likelihood that the second request to the second cache for the first set of data will be canceled.

BACKGROUND

The present disclosure relates to computing systems that employ one ormore caches. More particularly, the present disclosure relates tocompleting data requests based on selective downstream cache processing.

Cache memories in a computing system can improve processor, application,and/or computing system performance by storing data (e.g., a computerinstruction, or an operand of a computer instruction) in a memory thathas a lower access latency (time to read or write data) as compared toother memories, such as a main memory (e.g., primary RAM) or anon-volatile storage device (e.g., a disk). Cache memory can be includedin a processor, and/or between a processor and another memory (e.g.,another cache memory and/or a main memory) and can store a copy of dataotherwise stored in a main memory. For example, processors can include alocal, or “Level 1” (L1), cache, and computing systems can includeadditional caches, such as “level 2” (L2) and “level 3” (L3) caches,between a processor (or, a local cache of a processor) and anothermemory (e.g., a main memory).

SUMMARY

Various embodiments are directed to a computer-implemented method, asystem, and a computer program product. In some embodiments, thecomputer-implemented method includes receiving a first request to accessa first set of data in a first cache. A likelihood may be determinedthat a second request to a second cache for the first set of data willbe canceled. Access to the first set of data may be completed based onthe determining the likelihood that the second request to the secondcache for the first set of data will be canceled.

In some embodiments, a system comprises a computing device that includesa processor and at least a first cache and a second cache. The computingdevice further includes a set predictor configured to predict whetherthere will be a cache hit or cache miss within the first cache for afirst set of data of a first request. The computing device furtherincludes a request buffer configured to at least delay a second requestto the second cache for the first set of data when the set predictormodule predicts the cache hit, wherein the buffer does not delay thesecond request when the set predictor module predicts the cache miss.The computing device further includes a directory configured to at leastindicate an actual cache hit or miss.

In some embodiments, A computer program product comprises computerreadable storage medium having program instructions embodied therewith.The program instructions are readable or executable by a processor toperform a method. The method comprises receiving a first request toaccess a first set of data in a first memory. The method furthercomprises predicting whether there will likely be a hit or a miss in thefirst memory. The method also comprises initiating, in parallel with thepredicting whether there will likely be the hit or the miss in the firstmemory, a determination of whether there is an actual hit or actual missin the first memory. Moreover, the method comprises generating, based onthe predicting, a first action to facilitate access of the first set ofdata, the generating of the first action occurring before completion ofthe determination of whether there is an actual hit or actual miss inthe first memory.

The above summary is not intended to describe each illustratedembodiment or every implementation of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings included in the present application are incorporated into,and form part of, the specification. They illustrate embodiments of thepresent disclosure and, along with the description, serve to explain theprinciples of the disclosure. The drawings are only illustrative ofcertain embodiments and do not limit the disclosure.

FIG. 1 is a block diagram of an example processing environment,according to embodiments.

FIG. 2 is a block diagram of an example processing environment,according to embodiments.

FIG. 3 is a block diagram of an example multi-level cache processingenvironment, according to embodiments.

FIG. 4 is a flow diagram of an example process for selectivelyinitiating downstream cache processing, according to embodiments.

FIG. 5 is a flow diagram of an example process for selectivelyinitiating downstream cache processing, according to embodiments.

FIG. 6 is a flow diagram of an example process for selectivelyinitiating downstream memory processing, according to embodiments.

FIG. 7 is a block diagram of an example computing system, according toembodiments.

FIG. 8 is a block diagram of a computing system, according toembodiments.

While the invention is amenable to various modifications and alternativeforms, specifics thereof have been shown by way of example in thedrawings and will be described in detail. It should be understood,however, that the intention is not to limit the invention to theparticular embodiments described. On the contrary, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the invention.

DETAILED DESCRIPTION

Aspects of the present disclosure relate to selective downstream cacheprocessing for data access. While the present disclosure is notnecessarily limited to such applications, various aspects of thedisclosure may be appreciated through a discussion of various examplesusing this context.

A processor can determine if a copy of a cache line is included in alocal cache, such as when the processor executes an instruction thatreferences a memory location within a particular cache line. As usedherein, “cache line” refers interchangeably to a location in a memory,and/or a cache, corresponding to a cache line of data, and data storedwithin that cache line, as will be clear from the context of thereference. If the cache line is stored (“cached”) within a local cache,the processor can use data from within the cached copy of the cacheline. When a particular set of data or cache line is stored to aparticular cache, this is known as a cache “hit”. If there is no cachedcopy of the data or cache line in a particular cache, the processor canincur a “cache miss”. A cache “miss” in one level of memory (e.g., L1)can trigger a fetch request to another level of memory (e.g., L2).Accordingly, in response to the cache miss, the processor can fetch thecache line from the corresponding memory location, from another cache,and/or from another processor having a valid (e.g., an unmodified or,alternatively, most recently modified) copy of the cache line in a localcache.

Typical cache hit/miss processes may cause significant overhead. Forexample, final L2 cache hit/miss determinations (e.g., as looked up in acache directory) may cause poor access latency performance such that itmay take a long time to process a fetch request. Further, a fetchrequest to a higher level of cache after a lower level miss also takes asignificant amount of time, particularly if it occurs after determiningthe actual hit/miss result. Moreover, some systems are prone to cancel alot of fetch requests, which also take a significant quantity of time toprocess.

FIG. 1 is a block diagram of an example processing environment 100,according to embodiments. Core 110 comprises instruction pipeline 114and processing threads 116-1-116-4 (collectively, “threads 116”). Inembodiments, threads 116 can, for example, each record an executioncontext (e.g., various states and/or attributes) of a particularsequence of instructions executed by core 110.

In embodiments, a processor core can be a component of a processor chip,and the chip can include multiple cores of the same or different type.Embodiments can include one or more processor chips in a processormodule. As used herein, in addition to a “processor” including a localcache, “processor” further refers, interchangeably to any of a thread, acore, a chip, a module, and/or any other configuration or combinationthereof.

In embodiments, an instruction pipeline, such as pipeline 114, canenable a processor, such as 110, to execute multiple instructions, eachin various stages of execution, concurrently. To illustrate, pipeline114 can be an instance of an instruction pipeline such as examplepipeline 150. Pipeline 150 comprises a plurality of instructionprocessing stages for a processor to execute multiple instructions, orportions of a single instruction, concurrently. FIG. 1 depicts pipeline150 as comprising “fetch” stage 160, comprising fetch units F1-F4;“decode” stage 162, comprising decode units D1-D4; “issue” stage 164,comprising issue units I1-I4; execution stage Exec/L1 stage 166,comprising execution units E1-E4; and instruction completion stage,complete/reject 168, comprising completion units C1-C4.

While example pipeline 150 is shown comprising 5 stages, each havingfour units, this is not intended to limit embodiments. Embodiments caninclude additional (or, fewer) stages, and/or stages within an executionpipeline can contain additional (or, fewer) units in each stage ascompared to the example of FIG. 1 pipeline 150. “Deep” pipelines areexamples of processor pipelines that can have pipeline stages, and/orunits per stage, more than as shown in the example of FIG. 1. Likewise,while example cache 120 illustrates memory 122 comprising four cacheline entries, this is not intended to limit embodiments.

In embodiments, instructions under execution by a core can proceedsequentially through an instruction pipeline, such as 150. Fetch stage160 can fetch multiple instructions for execution using fetch unitsF1-F4. For example, instructions fetched by fetch stage 160 can proceedto decode stage 162, for concurrent decode using decode units D1-D4.Decoded instructions can be issued for execution via issue units I1-I4of issue stage 164. Issued instructions can proceed to execution stage166, and execution units E1-E4 can perform particular execution actionsof those issued instructions, such as performing Arithmetic Logic Unit(ALU) or other computation unit operations, and/or loading or storingmemory operands of the instructions. Completion units C1-C4 ofcomplete/reject stage 168 can complete, and/or flush orterminate/cancel, instructions from other stages of pipeline 150. Inembodiments, a pipelined processor can process a plurality ofinstructions, or portions of instructions, concurrently by means of thestages and units of the stages comprising an instruction pipeline.

Embodiments can utilize non-pipelined processors (e.g., multi-cycleprocessors), and these processors can include a local cache. If anoperand is not cached in a local cache, the processor can initiate cachemiss processing. In such non-pipelined embodiments, cache missprocessing can further include stopping or delaying execution ofinstructions using those operands, and/or instructions that may dependon the results of instructions using those operands.

Alternative embodiments can utilize pipelined processors, such asillustrated in FIG. 1, and a local cache can be a component of a unitwithin the pipeline, such as a load/store unit of an instructionpipeline. For example, in FIG. 1, local cache L1 is shown as a componentof execution unit (or, stage) E1 in execution pipe exec/L1 166. Whilenot shown, embodiments can include multiple execution and/or other unitsof an instruction pipeline that can each include local (e.g., L1)caches. It is recognized that the exec/L1 166 stage does not necessarilyneed to include an “L1” level of cache. For example, the Exec 166 stagecan include different levels (e.g., L2) or no levels at all.

In embodiments, L1 cache can be an instance of a cache such asillustrated by example cache 120. Cache 120 comprises a requestinterface module 126 and memory 122. Memory 122 includes cache lines124-1-124-4 (collectively, “lines 124”), which can, in embodiments,store copies of cache lines in use by core 110. In some embodiments, therequest interface module 126 performs one or more operations for cachehit/miss control, as explained in more detail below. In someembodiments, the request interface module 126 performs one or moreoperations associated with the execution steps 166 and/or thecomplete/reject steps 168.

The cache 120 also includes directory 128. In embodiments, the directory128 records the identities (e.g., a memory address, or subset or hashthereof) of cache lines stored in the cache 120. The cache directory 128can include other information about cache lines 124, such as most (or,alternatively, least) recent time it was referenced, or a number oftimes it has been referenced. The directory 128 can include a statusassociated with each of the cache lines 124 stored in the cache 120.Such status can include, for example, whether the cache line has sharedvs. exclusive status, whether the cache line is valid (e.g., contains anunmodified, or most recently modified copy), which processor (e.g.,which core within a processor chip if, for example, a local cache isshared by multiple cores), and other attributes of the cache line and/orits usage in the cache.

Execution units (e.g., E1) and/or other components (e.g., the requestinterface module 126) can determine, when using data in a cache line,whether the operands are stored in a local cache, such as L1 in E1. Ifit is determined that an operand is not cached in L1, the executionunit(s), the request interface module 126, and/or other components ofcore 110, can initiate cache miss processing. In embodiments, cache missprocessing can further include stopping or delaying execution ofinstructions (or, portions of instructions) using those operands, and/orinstructions that may depend on the results of instructions using thoseoperands.

In some embodiments, a processor core, such as 110, can execute aninstruction, or portions of an instruction, out of order and/orspeculatively. Out of order execution can allow a processor to executeportions of an instruction or program as soon as an execution unit(e.g., a stage in a pipeline) is available, rather than delay executionto wait for completion of other portions of an instruction, or otherinstructions in a program. In this way, a processor can keep most or allof its execution units busy to improve computing throughput.

Speculative execution can allow a processor to execute an instruction,or a portion of an instruction, based on a likelihood that the processorwill execute that instruction (or, portion thereof). For example, aprocessor can speculatively execute one or more instructions that followa particular branch path in a program, prior to executing a conditionaltest that determines that path, based on a likelihood that the programwill take that branch. In this way, a processor can utilize otherwiseidle elements (e.g., stages of a pipeline) and can achieve highercomputational throughput, in the event that the results of thespeculatively-executed instruction (or portion thereof) can be used asother instructions (or, portions of an instruction) complete execution.

FIG. 2 is a block diagram of an example processing environment 200,according to embodiments. In some embodiments, the processingenvironment 200 is or is included in the request interface module 126 ofFIG. 1. In some embodiments, the processing environment 200 isassociated with or part of the pipeline 150 of FIG. 1. The requestgeneration logic 201 sends out a request 203 (e.g., a fetch request),along with metadata of the request 203—i.e., request_info 202. Forexample, the request_info 202 can include the address of the data to befetched. The request_info 202 is transmitted to the request resourceallocation/request handling logic 215, while the request 203 isintercepted by cancel probability generator logic 207. The cancelprobability generator corresponds to a determination of whether arequest to another level of memory or cache for data is likely to becanceled. In some embodiments, the cancel probability generator 207includes a set predictor module that determines the likelihood of a hitor miss of a current level of cache that is analyzed. Accordingly, ifthere is a high likelihood that a current cache level has the neededdata (a hit), then there is a high likelihood that the request_cancel205 (to cancel the request 203) will be sent because the data wasalready located in the current level of cache analyzed and anotherrequest for the data in another level of cache/memory is not warranted.Alternatively, if there is a low likelihood that a current cache levelincludes the needed data (e.g., a miss), then there is a low likelihoodthe request_cancel 205 will occur because the data will need to befetched in another memory or cache.

If the cancel probability generator 207 determines that there is a lowprobability that request cancel 205 will be sent (e.g., because there isa predicted miss in a current level of cache being analyzed), thenanother level of cache may automatically be queried (i.e., via therequest_valid 209). The request_valid 209 fetch request may beautomatically issued to the request resource allocation/request handling215 regardless of whether the actual hit or miss data result is known(e.g., via a directory lookup). This automated process may occur becausefetch requests may take a relatively long amount of time to process andwhen combined with the amount of time that it takes to determine theactual hit or miss information, it may delay the process even more.Accordingly, the request_valid 209 fetch request may be issued to thenext level of cache or memory.

If the cancel probability generator 207 determines that there is a highlikelihood that the request_cancel 205 will be sent (e.g., because therewas a predicted hit in the current level of cache being analyzed), thenthe request to the request resource allocation 215 (i.e., therequest_valid_late 211) may be delayed 213 (e.g., buffered, temporarilyterminated, or discontinued, etc.). The request for the needed datawithin another level of cache may be delayed because of the highlikelihood that a processor will locate the needed data in a currentlevel of cache being analyzed. Accordingly, delaying 213 the requestkeeps the processing environment 200 from utilizing downstream resources(e.g., other levels of cache) if there is a high likelihood that theyare not needed.

The request_cancel 205 request is generated by the request generationlogic 201. The request_cancel 205 request is also delayed 213 so it cancancel the request_valid_late 211 request in time. For example, if therequest_cancel request 205 was not buffered, and there was an unexpectedactual hit/miss result, there may have already been an inaccurate cancelmessage transmitted to another cache level. Conversely, the delay 213allows for an actual hit/miss lookup such that even if the result wasunexpected, the delay action can be aborted in a buffer instead ofre-generating a request-cancel 205 and/or reversing downstream actionsalready communicated to the request resource allocation 215. Inembodiments, the request_cancel 205 is transmitted straight to therequest resource allocation 215 so all non-delayed requests (e.g.,request_valid 209) are canceled, as there is a low probability that therequest will be canceled.

FIG. 3 is a block diagram of an example multi-level cache processingenvironment 300, according to embodiments. In some embodiments, theenvironment 300 is a more detailed schema of the environment 200 of FIG.2. In some embodiments, the environment 300 is implemented by therequest interface module 126 of FIG. 1 and/or the pipeline 150 ofFIG. 1. The environment 300 illustrates that there are 3 levels ofcache. However, it is to be understood that the levels of cache arerepresentative only in that there may be fewer or more levels of cachethan represented in FIG. 3. In some embodiments, each level of cacherepresents another or different level of memory (e.g., main memory,non-volatile storage, etc.).

At operation 301 an L2 cache request is generated (e.g., by the requestgeneration logic 201 of FIG. 2). In an illustrative example, the L2request 301 may be a fetch request generated at L1 for a cache line inresponse to a “miss” result at L1. At a first time, the request 301 fordata (e.g., a cache line) is sent to the set predictor at 307. The setpredictor 307 predicts where the requested data can be found in L2. Theset predictor 307 also predicts whether there will be a cache hit ormiss at L2, as described in more detail below. The actual hit/miss isdetermined by a lookup in a directory lookup at 314. At the first timeor substantially close to the first time or in parallel with theprediction of the set predictor 307, the request 301 may be transmittedto the directory lookup 314. The directory lookup 314 may be associatedwith increased fetch latency. Typically, a high quantity of delay isassociated with using the directory lookup 314, whereas the setpredictor 307 is associated with decreased latency. Accordingly, theprediction of the set predictor 307 may be initiated at the same time orclose to the same time as initiation of the directory lookup 314. Insome embodiments, the set predictor 307 is identical to the cancelprobability generator 207 of FIG. 2.

In some embodiments, if the set predictor 307 predicts that there willbe a miss 309, then a “request_valid” request 315 (e.g., another fetchrequest for the same data) may be intercepted by the request arbiter316. If the set predictor 307 predicts that there will be a hit 311,then a “request_valid_late” request 318 may be transferred to therequest buffer 313.

The request buffer 313 is used to initially delay or pause therequest_valid_late request 318 (e.g., another request to L3 cache forthe same data) from going to arbitration by the request arbiter 316. Therequest buffer 313 (or the delay 213 of FIG. 2) may be implemented inorder to wait for the directory lookup 314 to be processed, as thedirectory lookup 314 will likely takes longer than the set predictor 307to complete processing. Moreover, there is no need to initiatedownstream processing to L3 cache 322 (e.g., transmit the L3request_valid request) if it is likely that there will be a hit in L2cache. Once the directory lookup 314 confirms that there is a L2 hit,the request_valid_late request 318 may be cleared from the requestbuffer 313 by the request_cancel 305 operation, so the request will notbe processed downstream. The request_cancel 305 may or may not becommunicated to L3 cache 322, but it should be ignored by the L3 cachebecause the request has been already cleared in the L2 cache. However,in the situation that the predicted hit was incorrect (i.e., there was amiss instead of a hit), the request_valid_late request 318 that has beenbuffered 313 may then be sent to the request arbiter 316. Therequest_valid_late request 318 may then be transmitted as the L3request_valid request to L3 cache.

The request arbiter 316 is able to hold multiple pending fetch requestsand determines whether to take or choose a “request_valid” request 315or a buffered request from the request buffer 313. Regardless of whetherthe buffered request or the “request_valid” request 315 is selected, therequest arbiter 316 translates the request into an “L3 request_valid”request to L3 cache 322 (another fetch request for the same data). Insome embodiments, such as in a more general L2 cache design, multiplerequests from previous L1 cache levels (e.g., the request_valid requestand/or the request_valid_late request) can be pending in the requestarbiter 316. This may increase the number of pending fetch requestsbeing held in the request arbiter 316 waiting for arbitration to thenext cache level.

In some situations it might be beneficial to override the result of theset predictor 307. In some embodiments, for example, the forcemodule/logic 312 can force the prediction result to always indicate acache hit (regardless of the set predictor results), when it isdetermined that a cache line is to be promoted from shared status toexclusive status. This means the cache line is already in the cache(cache hit) and does not need to be fetched from the next cache level.The term “exclusive” refers to a processor (or core) that has exclusiverights, or “exclusivity”, to a particular cache line (i.e., theprocessor does not share access rights to a cache line with any otherprocessor). In embodiments, a processor having exclusivity to a cacheline can change the status of the cache line from “shared” to“exclusive”. In some embodiments, while a cache line has exclusivestatus, a controlling processor can modify, in a local cache, datawithin that cache line. In some embodiments, the force logic 312 canalternatively force a cache miss prediction (regardless of the setpredictor results). For example, if the L2 cache reaches a taskthreshold or is otherwise busy, it may be desirable to force a cachemiss at L2 in order to fetch the same data from the L3 cache, which maynot be associated with as many tasks.

FIG. 4 is a flow diagram of an example process 400 for selectivelyinitiating downstream cache processing, according to embodiments. Theprocess 400 may be performed by processing logic that comprises hardware(e.g., circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processor to perform hardwaresimulation), firmware, or a combination thereof.

At block 402, an Lx (e.g., L2 cache) request may be received. Forexample, a lower cache level, such as L1, may have incurred a cache missfor a first cache line. Consequently, the L1 may transmit and the L2cache may receive a fetch request for the same first cache line. Perblock 404, it may be determined what the Lx predictor result is. The Lxpredictor result at block 404 may include a set predictor (e.g., the setpredictor 307 of FIG. 3) that predicts or estimates whether there willbe a cache hit or miss in the current level of cache being analyzed.

The Lx predictor may process predictions in any suitable manner. Forexample, in some embodiments, a pair of cache lines includes a steeringbit table (SBT) and a rehash bit that are utilized to render prediction.In these embodiments, when fetching a cache line entry, the effectiveaddress is used to index into the actual cache. A prediction index isused to select a particular steering bit. The steering bits are accessedprior to the cache access. Each entry “steers” references to theappropriate cache block. A rehash bit is utilized to avoid examininganother line when that line cannot contain the requested address. Arehash bit reduces the number of probes, which allows misses to bestarted earlier or reduces the time the cache is busy. Various types ofprediction sources may be utilized, such as effective addresses (asdescribed above), register contents and offset (e.g., using contents andoffset to form a prediction address), register number and offset(combining register number and offset several cycles before cacheaccess), and/or instruction and previous references (using address ofthe instruction issuing the reference and variants of the previous cachereference).

Per block 418, if it is predicted that there will be a cache miss, thenthe Lx request may be allowed to be transferred to arbitration (e.g., asprocessed by the request arbiter 316 of FIG. 3). Arbitration includeslogic to determine whether to prepare a predicted miss (e.g.,request_valid) or a buffered request (e.g., request_valid_late; block410) for a next cache level request (e.g., L3_reqeust_valid). Per block420, the Lx cache sends an Lx+1 (e.g., another cache level) request toanother level of cache (e.g., the L3_request_valid of FIG. 3).Generally, the processing time to predict the hit/miss result at block404 and arbitration at block 418 is relatively faster than the Lxdirectory lookup result at block 406. Accordingly, the Lx+1 request atblock 420 may generally occur before the Lx directory lookup at block406. Therefore, regardless of when the Lx directory lookup result occursat block 406, a speculative request is sent to the next level of cachefor downstream processing because of the high latency of the Lxdirectory lookup and because of the high probability that a request willneed to be sent out anyway after the actual Lx directory lookup resultat block 406 is completed.

Per block 406, if the Lx predictor result predicted a miss at block 404,then a request may be initiated to lookup the actual hit/miss result inthe Lx directory lookup at block 406 (e.g., using the directory 128 ofFIG. 1; the directory lookup 314 of FIG. 3). In some embodiments, thearbitration at block 418 and/or the sending of the Lx+1 request at block420 is done in parallel or substantially the same time as the initiationor beginning of the Lx directory lookup (not the result of the lookup).However, although these processes may be initiated in parallel, thelatency to complete the lookup result at block 406 may be much longerthan completing the arbitration at block 418. Accordingly, in somesituations, a speculative Lx+1 request is sent at block 420 to anotherlevel of cache memory before the actual Lx directory lookup result atblock 406. In some embodiments, the Lx directory lookup result at block406 occurs via a table. The table may include memory addresses of cachelines currently stored in the Lx cache. Therefore, if the address of thecache line requested is found in the Lx directory, then there may be a“hit.” Otherwise, if the address is not in the directory, then there maybe a “miss”.

Per block 408, if the actual directory lookup result at block 406 is a“hit,” then an Lx+1 cancel request (e.g., the request_cancel 305 of FIG.3) may be transmitted to the next level of cache in order to cancel theLx+1 request at block 420. In some situations, although it may bepredicted that there will be a miss at block 404, the actual result maybe a hit at block 406. Accordingly, because the hit at block 406 may beunexpected, a cancellation of the speculative Lx+1 request may be neededto prevent the furthering of downstream processing, as the data neededhas been located at the current level of cache being analyzed. If the Lxdirectory lookup result at block 406 is a miss, the process 400 maystop.

Per block 410, if it is predicted at block 404 that there will be acache hit, then an Lx+1 request may still be generated but buffered ortemporarily paused (e.g., the delay 213 of FIG. 2). A predicted cachehit at block 404 may indicate that an Lx+1 request is likely to becanceled. This request to the next cache/memory level may be delayed soas to avoid resource allocation in the next level(s) of memory. Typicalsystems may transmit a “cancel” indication to the next level of cacheeven though an actual hit or miss has not yet been determined. However,if the prediction turned out to be wrong, and there was actually a miss,the cancel indication signal may have been futile and have causedunnecessary latency because an Lx+1 request will have needed to be sentto the next level of cache to obtain the data. The Lx+1 request may bestored to a buffer at block 410. A buffer in the context of block 410may be a temporary holding place for data to wait for other processes tooccur first. For example, the buffer may be utilized to prevent the Lx+1request from going to arbitration at block 418. The buffer may also beutilized to wait for the Lx directory lookup result at block 412.

Per block 412, while the Lx+1 request is buffered, the Lx directorylookup result is determined (e.g., by the directory 128 of FIG. 1). Ifthe actual result is a hit at block 412, then per block 414, the Lx+1request is cleared or emptied from the buffer. This effectively preventsthe Lx+1 request from be transmitted to another cache memory. Therefore,the Lx+1 cache does not receive a request on its interface for a cacheline, as the data is already located in cache Lx. Per block 416, the Lxrequest may be executed and completed such that data, for example, isreturned to the calling processor from the Lx cache.

FIG. 5 is a flow diagram of an example process 500 for selectivelyinitiating downstream cache processing, according to embodiments. Theprocess 500 may be performed by processing logic that comprises hardware(e.g., circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processor to perform hardwaresimulation), firmware, or a combination thereof.

At block 502, a first request may be received (e.g., from a processorand at a particular level of cache) to access (e.g., fetch) a first setof data in a first cache. For example, the request generation logic 201of FIG. 1 may be included on a processor and may generate a request tobe received by the first cache.

Per block 504, a cancel probability that a second request to a secondcache for the first set of data will be canceled may be generated (e.g.,by the cancel probability generator 207 of FIG. 2). Accordingly,determining a likelihood that the second request to the second cache forthe first set of data will be canceled is made. For example, the secondrequest to query the second cache may likely be canceled if there isalready a cache hit at the first cache. Conversely, there may be no suchlikelihood of cancellation if there is a cache miss at the first cache.The cancel probability may occur in any suitable manner. For example, athreshold integer value or generated score estimate may indicate thelikelihood. In some embodiments the score is based on various factors,such as a set predictor prediction result (prediction of whether therewill likely be a cache hit or cache miss), the workload of a cache,whether a cache line is read-only or write-only, determinations made bythe force logic 312 of FIG. 3, and/or the exclusive or shared status ofa processor. For example, even if there is a cache hit at the firstcache, there may be a threshold quantity of task handling (e.g., becausethe cache has a relatively large storage capacity and is busy) currentlyon the first cache such that a request to the second cache is warrantedto decrease fetch latency. In another example, there may be a cache hitat the first cache, but the first set of data may be exclusive toanother processor such that the current calling processor has to fetchthe first set of data from the second cache instead. In someembodiments, instead or in addition to generating a “cancel” probabilityand likelihood as indicated in blocks 504 and 506, a “non-cancel”probability may be generated. The rest of the process 500 below block504 helps to complete access to the first set of data based on thedetermining the likelihood that the second request to the second cachefor the first set of data will be canceled.

Per block 506, based on the probability at block 504, it may bedetermined whether the second request to the second cache is likely tobe canceled. If the second request is not likely to be canceled, thismeans that the second request will likely need to be transmitted toanother cache/memory to retrieve the first set of data. Per block 508,if it is determined that the second request is not likely to becanceled, then the second request may be transmitted (e.g., by therequest generation logic 201 of FIG. 2) to the second cache in order toaccess the first set of data. This may occur prior to a directory lookupindicating an actual cache hit or cache miss and in response to thedetermination at block 506.

Per block 510, it may be determined (e.g., via the cancel probabilitygenerator 207) whether to continue processing the second request. Forexample, it may be determined whether there is an actual cache miss inthe first cache and whether the first set of data is exclusive toanother processor. Continuing with this example, if there is an actualcache miss at the current level of cache being analyzed and the firstset of data is not exclusive, then per block 520, the second request atthe second cache may continue to be executed, as the request wasinitiated at block 508. Continuing with this example, if there is anactual cache hit (inconsistent with the likelihood result at block 506),then per block 512 the second request may be canceled such that thefirst cache may transmit a third request to the second cache to cancelthe second request. And because there is a cache hit at the first cache,per block 522, the first request at the first cache may be executed orcompleted.

Per block 514, if it is determined at block 506 that the second requestto the second cache for the first set of data will likely be canceled,then the second request may be delayed (e.g., the delay 213 of FIG. 2).The delaying of the second request may occur to at least prevent thetransmitting of the second request to the second cache (e.g., because itis likely that there is a cache hit at the first cache and the first setof data is not exclusive to another processor). The second request mayalso be delayed for other reasons such as waiting for the actual cachehit/miss determination at block 516 and/or waiting for arbitration(e.g., the request arbiter 316 of FIG. 3).

Per block 516, it may be determined whether to proceed with thecancellation projected at block 506. For example, the block 516determination may be based on whether there is an actual cache hit inthe first cache and the read/write status of the first set of data. Perblock 512, if the request should actually be canceled (e.g., if itdetermined that there is an actual cache hit in the first cache and/orthe read/write status doesn't match the second request), then the secondrequest is canceled to the second cache such that the second request isnot transmitted to the second cache. For example, the second request maybe buffered as part of the delay at block 514. In response to thedetermining of an actual cache hit at block 516, the second request inthe buffer may be deleted or cleared. At block 522, the first requestmay be executed at the first cache.

Per block 518, if it is determined that the second request should not becanceled, then the second request may be transmitted to the second cachein order to access the first set of data in the second cache.Accordingly, at block 520 the second request at the second cache may beexecuted. In some situations, the first set of data will also not belocated in the second cache, but rather a third or n level of cache ormemory. In these cases, a similar process to the process 500 may occurwith respect to the third or n level of cache.

FIG. 6 is a flow diagram of an example process 600 for selectivelyinitiating downstream memory processing, according to embodiments. Theprocess 600 may be performed by processing logic that comprises hardware(e.g., circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processor to perform hardwaresimulation), firmware, or a combination thereof.

At block 602, a first request (e.g., a fetch request) to access a firstset of data in a first memory (e.g., L1 cache) may be received. Perblock 604 it may be predicted (e.g., by the set predict logic 307 ofFIG. 3) whether there will likely be a hit or a miss in the firstmemory. If it is predicted that there will not be a hit (e.g., therewill be a miss), then blocks 606 and 608 may be performed in parallel orat substantially the same time.

Per block 606, actual hit/miss processing may be initiated (e.g.,beginning a search for the first set of data in a directory). In somecomputing systems, completing actual hit/miss processing takes arelatively long quantity of time compared to block 608. For example,searching in a cache directory and locating/not locating the first setof data may take twice as long as block 608. Per block 608, a secondrequest may be generated (e.g., by the request generation logic 201 ofFIG. 2) and transmitted to a second memory (e.g., primary RAM) in orderto access the first set of data from the second memory. Any one of theoperations at blocks 608 and/or 616 indicates a generating of an actionto facilitate access of the first set of data. The generating of theaction occurs before completion of a determination of whether there isan actual hit or actual miss in the first memory.

Per block 610, the actual hit/miss processing may complete after it hasbeen initiated at block 606 and after the transmission of the secondrequest at block 608. For example, the completion may occur when thefirst set of data is located in a directory in the first cache or eachentry in the directory was searched without locating the first set ofdata. Per block 612, it may be determined (e.g., via a directory)whether there was an actual miss (and/or hit). If there was an actualmiss, then the process 600 may stop. Per block 614, if there was not anactual miss (e.g., there was a hit), the second request may be canceledsince the first set of data was located in the first memory.Accordingly, any downstream processing by the second memory should beaborted.

Per block 616, if it was predicted at block 604 that there would be ahit then a second request may be generated and buffered (e.g., in therequest buffer 313 of FIG. 3). The second request is generated in orderto access the first set of data from a second memory if for some reasonthe first set of data cannot or should not be accessed from the firstmemory.

Per block 618, it is determined (e.g., via a directory) whether there isan actual hit (and/or miss) in the first memory. If there is not a hit(e.g., there is a miss), then per block 620 the second request may beresumed and transferred from the buffer to the second memory in order toaccess the first set of data. In some embodiments, the process 600 maythen repeat for other levels of memory in order to access the first setof data.

Per block 622, if there was an actual hit for the first set of data atthe first memory, the second request is cleared from the buffer. Thesecond request is cleared or deleted from the buffer because with anactual hit in the first memory there is no need to transmit the accessrequest for the first set of data in the second memory. Per block 624,in response to the hit, the first set of data is accessed from the firstmemory to complete the first request.

FIG. 7 is a block diagram of an example computing system, according toembodiments. FIG. 7 illustrates an example computer 01 having aplurality of processors interconnected to a cache and memory through anetwork 20 (e.g., an SMP network). In embodiments an SMP network canoperate to exchange data and/or logic signals (e.g., status indicators,protocol commands and/or responses, etc.) between processors, caches,and/or memories. In some embodiments, an SMP network can be aware ofparticular memory locations stored in cache lines of various cachesand/or processors.

As shown in FIG. 7, computer 01 includes processor CHIP 10-1 and CHIP10-2 (hereinafter, “chips 10”), L2 30, and MEMORY 40 interconnected byNETWORK 20. CHIP 10-1 and CHIP 10-2 include processors CORES 12-1-12-N(hereinafter, “cores 12”). In some embodiments, some or each of thecores 12 are considered to be cores similar to core 110 of FIG. 1, andcan include a local (e.g., L1) cache and a pipeline (e.g., pipeline 114of FIG. 1).

Likewise, in some embodiments, L2 30 is a cache similar to cache 120 ofFIG. 1, and can include a request interface module 126 and a memory.Caches included in cores 12 and L2 30, and the memory, can be organizedinto cache lines. Further, while L2 30 and MEMORY 40 are shown in FIG. 7as singular elements, it would be appreciated by one of ordinary skillin the art that, in embodiments, L2 30 and/or MEMORY 40 can comprisevarious numbers and/or types of memories, and/or arrangements ofmemories, such as caches included in memories, caches and/or memoriesconnected hierarchically, and/or caches and/or memories connected inparallel with each other. Accordingly, as used herein, “L1” furtherrefers to any form of cache integrated into or contained within aprocessor, and “L2” further refers to any next level cache (or,combination or arrangement of caches) connected between a local cacheand another, higher level cache (e.g., an L3) and/or a main memory.

As previously described, a memory and/or a cache can be organized ascache lines of a particular size. For example, MEMORY 40 can beorganized as cache lines, and the cache lines can be, for example, 128bytes in size. In embodiments, a processor (e.g., core 212-3) caninclude a cache, such as a local cache, and store a copy of data storedin a cache line of a memory, in the L1 cache. For example, MEMORY 40includes cache line 46, which further contains data at locations 42 and44. In embodiments, location 42 and/or 44 can be a location, in memory40, of any unit of data ranging from a minimum size unit of data used bya processor (e.g., one byte) up to and including the amount of datacomprising cache line 46 (e.g., 128 bytes).

In the example of FIG. 7, NETWORK 20 comprises CONNECT 22 and CACHEREQ-RSP 24. In embodiments, CONNECT 22 can operate to interconnect cores12 with L2 30 and/or MEMORY 40. CACHE REQ-RSP 24 represents a cachemanagement element of COMPUTER 01. In embodiments, a cache managementelement can process cache line fetch requests and/or cache line fetchresponses. Embodiments of a cache management element, such as CACHEREQ-RSP 24, can additionally have awareness of which processors and/orcaches have copies of cache lines of a memory (e.g., line 42 of MEMORY40), status of such cache lines (e.g., shared or exclusive, or read-onlyor read/write), and/or whether (and, which) processors have incurred anintervention associated with a cache line fetch.

The example of FIG. 7 illustrates cores 12 as connected to CONNECT 22 bymeans of interconnects 14, L2 30 by means of interconnect 16, and MEMORY40 by means of interconnect 18. In embodiments, CONNECT 22 and/orinterconnects 14, 16, and 18 can comprise a bus, point-to-point links,and/or a cross bar switch, or any combination or arrangement of these.For example, CONNECT 22 can be a crossbar or packet switch andinterconnects 14, 16, and 18 can be point-to-point links connecting toswitch input and/or output connections to CONNECT 22. In alternativeembodiments, CONNECT 22 can be a bus and interconnects 14, 16, and 18can be bus connections to, and/or extensions of, a bus comprisingCONNECT 22.

In other embodiments, CONNECT 22 and/or interconnects 14, 16, and 18 cancomprise a combination of buses, links, and/or switches. For example,while not shown, it would be apparent to one of ordinary skill in theart that cores of a processor chip, such as 12-1-212-4 can interconnectamongst each other internal to CHIP 10-1—such as by means of buses,links, and/or switches—and that interconnect 14 can be a singleconnection between CHIP 10-1 and CONNECT 22. It would be furtherapparent to one of ordinary skill in the art that CONNECT 22, and themanner of connecting processor cores, chips, modules and/or caches andmemories, can comprise a variety of types, combinations, and/orarrangements of interconnection mechanisms such as are known in the art,such as buses, links, and/or switches, and that these can be arranged ascentralized, distributed, cascaded, and/or nested elements.

An SMP network, and/or component thereof, can control and/or maintainstatus of cache lines amongst the plurality of caches. To illustrate, inthe example of FIG. 7, in embodiments CACHE REQ-RSP 24 is representativeof cache request/response functions within an SMP network that can beassociated with processing cache line fetch requests, responses, and/orinterventions, among processors, caches, and/or memories interconnectedby means of the SMP network. Such functions can include, for example,having awareness of the locations of cache lines among processors,caches, and/or memories, and/or having awareness of and/or participatingin processing cache line fetches. In some embodiments, a processor can“snoop” the cache line requests of other processors and, in this way,can be aware of another processor having a copy of a missed cache lineand, in some embodiments, can directly request a cache line fetch fromanother processor known to have a copy.

Embodiments can implement cache line request/response functions within acentralized unit, such as illustrated by CACHE REQ-RSP 24 in FIG. 7. Inother embodiments, cache line request/response functions can distributedamongst processors, caches, and/or memories. In embodiments, one or morecores and/or chips can perform some cache line request/responsefunctions, and one or more caches can perform other cache linerequest/response functions. Using the example of FIG. 7, one or more ofcores 12, and/or chips 10, and/or one or more caches (e.g., local cachesof cores 212 and/or L30) can perform cache line request/responsefunctions. Cores 12 can each maintain status of cache lines locatedwithin respective local caches, and L2 30 and/or CACHE REQ-RSP 24 canalso maintain awareness and/or status of cache lines cached in thevarious local caches of cores 12. Cores 12 and/or L2 30 can maintainstatus of cache lines located within respective local caches and/or L230, while CACHE REQ-RSP 24 can receive and/or process interventionsassociated with cache line fetches directed to processors among cores12.

As used herein, “SMP network” refers interchangeably to an SMP networkas a whole (e.g., NETWORK 220) and components of the SMP network (e.g.,CACHE REQ-RSP 24), processors (e.g., chips 10 and/or cores 12), and/orcaches (e.g., local caches of cores 12 and/or L2 30) used performingfunctions associated with cache line requests and responses. Continuingthe example of FIG. 7, NETWORK 20 can route communications between cores12, L2 30, and/or MEMORY 40, such as by means of CONNECT 22. NETWORK 20can receive cache line fetch requests from the cores, cache line fetchresponses, and/or intervention notifications and can route these amongcores 12, L2 30, and/or MEMORY 40 (e.g., main memory). NETWORK 20 canhave awareness of locations, within various caches, having copies ofparticular cache lines, and/or status of those cache lines, such aswhether a particular cache line is shared amongst multiple processorsand/or is subject to modification by a particular processor.

In embodiments, a processor can operate on data for one or multipleinstructions using the cached copy of a memory cache line. For example,with reference to FIG. 7, CORE 12-1 can execute an instruction that usesdata at location 42 in MEMORY 40 and can use the data at location 42within a copy of cache line 46 in a local cache of CORE 12-1. Inembodiments, if a processor incurs a cache miss for that cache line usedin processing (e.g., executing) instructions, the processor can initiatea fetch of the cache line, and the fetch can obtain a copy of the cacheline from another cache within the computing system, or from the memory.For example, with reference again to FIG. 7, if CORE 12-1 uses data incache line 46 but does not already have a copy of cache line 46 in alocal cache, CORE 12-1 can initiate a request to fetch cache line 46. Inembodiments, initiating a fetch of a cache line can comprise a corecommunicating to an SMP network information about the cache line (e.g.,a memory address and/or whether it is requested as a shared or anexclusive use or, alternatively, read-only or read/write). Inalternative embodiments, initiating a fetch of a cache line can comprisea core communicating information about the cache line directly toanother component of a system (e.g., another core, a cache, or a memory)known to have a valid copy of the cache line.

As previously described, in embodiments, under some circumstances (e.g.,when a cache line has shared status), multiple processors in a computingsystem can cache a copy of a cache line in a respective local cache ofthe processors. In processing a cache line fetch request, the requestcan be satisfied by providing a copy of the cache line from one of theprocessors having a copy. For example, CORE 12-1 can request a copy ofcache line 46 and, if a local cache of another core among cores 12, hasa valid copy of the cache line, a copy of cache line 46 can betransferred from the local cache of that core to CORE 12-1 to satisfythe fetch request. However, if another core does not have a valid copyof cache line 46, but L2 30 has a valid copy, a copy of cache line 46can be transferred from L2 30 to CORE 12A to satisfy the fetch request.If no caches in the computing system have a valid copy of cache line 46,a copy of cache line 46 can be transferred from MEMORY 40 to CORE 12A tosatisfy the fetch request.

From the example of FIG. 7, it can be seen that transferring cache linesbetween processors, caches, and/or memories has an associatedutilization of those elements and/or the elements interconnecting them(e.g., an SMP network). Transferring cache lines can have an associated“overhead” in terms of, for example, numbers of instruction cyclesassociated with latency to complete a cache line transfer, data transferbandwidth or throughput, and/or computing bandwidth or throughput. Inembodiments, overhead can include increased utilization of data buses,inter-processor links, and/or inter-memory links to transfer the cacheline; increased instruction execution latency (awaiting completion ofthe transfer) for a requesting processor to complete execution of one ormore instructions that use the cache line; and increased processorand/or cache utilization in processors to manage and perform thetransfer.

Transfer latency (time required to receive a cache line following afetch request) can increase based on which element (e.g., a particularcache or a memory) provides a copy of a cache line to satisfy a fetchrequest. For example, transferring a cache line from a core within adifferent chip, or from another cache not local to a processor, can havea much higher latency in comparison to transferring a cache line from acore with the same chip, or a cache more close (having fewerinterconnections) to a requesting processor. High transfer latency cancause a processor to wait longer to perform an operation, or to completean instruction, that uses data within that cache line, and in turn thiscan reduce processor performance. For example, fetching data notincluded in a local cache of a processor can correspond to many hundredsor thousands of processor execution cycles. Accordingly, it can beadvantageous to processor and/or overall computing system performance toreduce cache line fetches associated with multiple processors using acache line.

It is to be understood that although the computer 01 of FIG. 7 isillustrated as having a particular quantity of chips 10, cores 12, andother components, this quantity is representative only and accordinglythere may be more or fewer components than illustrated. In someembodiments, some or each of the processes described in FIGS. 4, 5, and6 may be implemented by the computer 01.

FIG. 8 is a block diagram of a computing system 800, according toembodiments. As shown in FIG. 8, computer system 800 includes computer53 having processors 13-1 and 13-2. In embodiments, the computer 53 canbe or include the components as described in the computer 01 of FIG. 7and vice versa. Likewise processors 13-1 and/or 13-2 can compriseprocessors such as previously described (e.g., CORE 110 of FIG. 1), ageneral purpose or a special purpose processor, a co-processor, or anyof a variety of processing devices that can execute computinginstructions.

FIG. 8 illustrates computer system 800 configured with interface 005coupling computer 53 to input source 03. In embodiments, interface 05can enable computer 53 to receive, or otherwise access, 05, input datavia, for example, a network (e.g., an intranet, or a public network suchas the Internet), or a storage medium, such as a disk drive internal orconnected to computer 53. For example, input source 03 can be an SMPnetwork, (e.g., NETWORK 20 in FIG. 7) or another processor, such asillustrated in a core among cores in FIG. 7, and input source 03 canprovide requests to fetch a cache line or a data object, to computer 53,or otherwise enable computer 53 to receive a request to fetch a cacheline or data object, to receive a cache line or a data object, usinginterface 05.

Interface 05 can be configured to enable human input, or to couplecomputer 53 to other input devices, such as described later in regard tocomponents of computer 53. It would be apparent to one of ordinary skillin the art that the interface can be any of a variety of interface typesor mechanisms suitable for a computer, or a program operating in acomputer, to receive or otherwise access or receive a source netlist.

Processors included in computer 53 are connected by a memory interface15 to memory 17. In embodiments a “memory” can be a cache memory, a mainmemory, a flash memory, or a combination of these or other varieties ofelectronic devices capable of storing information and, optionally,making the information, or locations storing the information within thememory, accessible to a processor. A memory can be formed of a singleelectronic (or, in some embodiments, other technologies such as optical)module or can be formed of a plurality of memory modules. A memory, or amemory module (e.g., an electronic packaging of a portion of a memory),can be, for example, one or more silicon dies or chips, or can be amulti-chip module package. Embodiments can organize a memory as asequence of bytes, words (e.g., a plurality of contiguous or consecutivebytes), or pages (e.g., a plurality of contiguous or consecutive bytesor words).

In embodiments, the computer 53 can include a plurality of memories. Amemory interface, such as 15, between a processor (or, processors) and amemory (or, memories) can be, for example, a memory bus common to one ormore processors and one or more memories. In some embodiments, a memoryinterface, such as 15, between a processor and a memory can be point topoint connection between the processor and the memory, and eachprocessor in the computer can have a point-to-point connection to eachof one or more of the memories. In other embodiments, a processor (forexample, 13-1) can be connected to a memory (e.g., memory 17) by meansof a connection (not shown) to another processor (e.g., 13-2) connectedto the memory (e.g., 17 from processor 13-2 to memory 17).

The computer 53 includes an IO bridge 25, which can be connected to amemory interface, or (not shown), to a processor, for example. In someembodiments, an IO bridge can be a component of a processor or a memory.An IO bridge can interface the processors and/or memories of thecomputer (or, other devices) to IO devices connected to the bridge. Forexample, computer 53 includes IO bridge 25 interfacing memory interface15 to IO devices, such as IO device 27. In some embodiments, an IObridge can connect directly to a processor or a memory, or can be acomponent included in a processor or a memory. An IO bridge can be, forexample, a PCI-Express or other IO bus bridge, or can be an IO adapter.

An IO bridge can connect to IO devices by means of an IO interface, orIO bus, such as IO bus 31 of computer 53. For example, IO bus 31 can bea PCI-Express or other IO bus. IO devices can be any of a variety ofperipheral IO devices or IO adapters connecting to peripheral IOdevices. For example, IO device 29 can be a graphic card, keyboard orother input device, a hard drive or other storage device, a networkinterface cards, etc. IO device 29 can be an IO adapter, such as aPCI-Express adapter, that connects components (e.g., processors ormemories) of a computer to IO devices (e.g., disk drives, Ethernetnetworks, video displays, keyboards, mice, etc.).

A computer can include instructions executable by one or more of theprocessors (or, processing elements, such as threads of a processor).The instructions can be a component of one or more programs. Theprograms, or the instructions, can be stored in, and/or utilize, one ormore memories of a computer. As illustrated in the example of FIG. 8,computer 53 includes a plurality of programs, such as program 09-1,09-2, 11-1. A program can be, for example, an application program, anoperating system or a function of an operating system, or a utility orbuilt-in function of a computer. A program can be a hypervisor, and thehypervisor can, for example, manage sharing resources of the computer(e.g., a processor or regions of a memory, or access to an IO device)among a plurality of programs or operating systems. A program can be aprogram that embodies the methods, or portions thereof, of thedisclosure. A program can be a program that embodies the methods, orportions thereof, of the disclosure. For example, a program can be aprogram that executes on a processor of computer 410 to perform one ormore methods similar to example processes 400, 500, or 600 in FIGS. 4,5, and/or 6. A program can perform methods similar to these methodsmodified, as would be understood by one of ordinary skill in the art,suitably for applications sharing data objects in a system such asillustrated in FIG. 1, FIG. 2, and/or FIG. 3.

Programs can be “stand-alone” programs that execute on processors anduse memory within the computer directly, without requiring anotherprogram to control their execution or their use of resources of thecomputer. For example, computer 53 includes stand-alone program 11-2. Astand-alone program can perform particular functions within thecomputer, such as controlling, or interfacing (e.g., access by otherprograms) an IO interface or IO device. A stand-alone program can, forexample, manage the operation, or access to, a memory. A Basic I/OSubsystem (BIOS), or a computer boot program (e.g., a program that canload and initiate execution of other programs) can be a standaloneprogram.

A computer can include one or more operating systems, and an operatingsystem can control the execution of other programs such as, for example,to start or stop a program, or to manage resources of the computer usedby a program. For example, computer 53 includes operating systems (Os)07-1 and 07-2, each of which can include, or manage execution of, one ormore programs, such as OS 07-2 including (or, managing) program 11-1. Insome embodiments, an operating system can function as a hypervisor.

A program can be embodied as firmware (e.g., BIOS in a desktop computer,or a hypervisor) and the firmware can execute on one or more processorsand, optionally, can use memory, included in the computer. Firmware canbe stored in a memory (e.g., a flash memory) of the computer. Forexample, computer 53 includes firmware 19 stored in memory 17. In otherembodiments, firmware can be embodied as instructions (e.g., comprisinga computer program product) on a storage medium (e.g., a CD ROM, a flashmemory, or a disk drive), and the computer can access the instructionsfrom the storage medium.

The example computer system 800 and computer 53 are not intended tolimiting to embodiments. In embodiments, computer system 800 can includea plurality of processors, interfaces, and <inputs> and can includeother elements or components, such as networks, network routers orgateways, storage systems, server computers, virtual computers orvirtual computing and/or IO devices, cloud-computing environments, andso forth. It would be evident to one of ordinary skill in the art toinclude a variety of computing devices interconnected in a variety ofmanners in a computer system embodying aspects and features of thedisclosure.

In embodiments, computer 53 can be, for example, a computing devicehaving a processor capable of executing computing instructions and,optionally, a memory in communication with the processor. For example,computer 53 can be a desktop or laptop computer; a tablet computer,mobile computing device, or cellular phone; or, a server computer, ahigh-performance computer, or a super computer. Computer 53 can be, forexample, a computing device incorporated into a wearable apparatus(e.g., an article of clothing, a wristwatch, or eyeglasses), anappliance (e.g., a refrigerator, or a lighting control), a vehicleand/or traffic monitoring device, a mechanical device, or (for example)a motorized vehicle. It would be apparent to one of ordinary skill inthe art that a computer embodying aspects and features of the disclosurecan be any of a variety of computing devices having processors and,optionally, memories and/or programs.

The present invention may be a system, a method, and/or a computerprogram product at any possible technical detail level of integration.The computer program product may include a computer readable storagemedium (or media) having computer readable program instructions thereonfor causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, configuration data for integrated circuitry, oreither source code or object code written in any combination of one ormore programming languages, including an object oriented programminglanguage such as Smalltalk, C++, or the like, and procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The computer readable program instructions may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider). In some embodiments, electronic circuitry including,for example, programmable logic circuitry, field-programmable gatearrays (FPGA), or programmable logic arrays (PLA) may execute thecomputer readable program instructions by utilizing state information ofthe computer readable program instructions to personalize the electroniccircuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems and/or methods according to various embodiments of thepresent invention. In this regard, each block in the flowchart or blockdiagrams may represent a module, segment, or portion of instructions,which comprises one or more executable instructions for implementing thespecified logical function(s). In some alternative implementations, thefunctions noted in the blocks may occur out of the order noted in theFigures. For example, two blocks shown in succession may, in fact, beexecuted substantially concurrently, or the blocks may sometimes beexecuted in the reverse order, depending upon the functionalityinvolved. It will also be noted that each block of the block diagramsand/or flowchart illustration, and combinations of blocks in the blockdiagrams and/or flowchart illustration, can be implemented by specialpurpose hardware-based systems that perform the specified functions oracts or carry out combinations of special purpose hardware and computerinstructions.

The descriptions of the various embodiments of the present disclosurehave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

Described below are particular definitions specific to the presentdisclosure:

“And/or” is the inclusive disjunction, also known as the logicaldisjunction and commonly known as the “inclusive or.” For example, thephrase “A, B, and/or C,” means that at least one of A or B or C is true;and “A, B, and/or C” is only false if each of A and B and C is false.

A “set of” items means there exists one or more items; there must existat least one item, but there can also be two, three, or more items. A“subset of” items means there exists one or more items within a groupingof items that contain a common characteristic.

The terms “receive,” “provide,” “send,” “input,” “output,” and “report”should not be taken to indicate or imply, unless otherwise explicitlyspecified: (i) any particular degree of directness with respect to therelationship between an object and a subject; and/or (ii) a presence orabsence of a set of intermediate components, intermediate actions,and/or things interposed between an object and a subject.

A “module” is any set of hardware, firmware, and/or software thatoperatively works to do a function, without regard to whether the moduleis: (i) in a single local proximity; (ii) distributed over a wide area;(iii) in a single proximity within a larger piece of software code; (iv)located within a single piece of software code; (v) located in a singlestorage device, memory, or medium; (vi) mechanically connected; (vii)electrically connected; and/or (viii) connected in data communication. A“sub-module” is a “module” within a “module.”

The terms first (e.g., first cache), second (e.g., second cache), etc.are not to be construed as denoting or implying order or time sequences.Rather, they are to be construed as distinguishing two or more elements.In some embodiments, the two or more elements, although distinguishable,have the same makeup. For example, a first memory and a second memorymay indeed be two separate memories but they both may be RAM devicesthat have the same storage capacity (e.g., 4 GB). Moreover, a “firstcache” and a “second cache,” etc. is not to be construed as a particularlevel of cache (e.g., L1 and L2), but are to be construed as differentcaches in general.

As used herein, “processor” refers to any form and/or arrangement of acomputing device using, or capable of using, data stored in a cache,including, for example, pipelined and/or multi-cycle processors,graphical processing units (GPUs), and/or neural networks. Also, as usedherein, “computing system” refers to a computing system that employsprocessors utilizing data stored in one or more caches. However, this isnot intended to limit embodiments, and it would be appreciated by one ofordinary skill in the art that embodiments can employ other varietiesand/or architectures of processors within the scope of the disclosure.

What is claimed is:
 1. A computer-implemented method comprising:detecting that a data fetch request to a level 1 (L1) cache lineresulted in an L1 cache miss; generating, by logic in the L1 cache andin response to detecting the L1 cache miss, a first request to access afirst set of data in a level two (L2) cache; transmitting the firstrequest to the L2 cache; predicting, by logic in the L2 cache, alocation of the first set of data in the L2 cache; predicting that therewill be a cache miss in the L2 cache for the first request; generating,in response to predicting that there will be a cache miss in the L2cache for the first request, a second request to access the first set ofdata in a level 3 (L3) cache; generating a cancel probability scorebased on at least one of the workload of a cache, whether a cache lineis read-only or write-only, and the exclusive or shared status of aprocessor; determining that the second request to the L3 cache for thefirst set of data will likely not be canceled based on the cancelprobability score; transmitting, prior to a directory lookup indicatingan actual L2 cache hit or actual L2 cache miss and in response todetermining that the second request to the L3 cache for the first set ofdata will likely not be canceled, the second request to the L3 cache;performing a directory lookup for the first set of data in the L2 cache;determining, via the directory lookup, that there is an actual cache hitin the L2 cache for the first request; transmitting, in response to thedetermining that there is the actual cache hit in the L2 cache for thefirst request, a third request to the L3 cache to cancel the secondrequest; clearing, by the L3 cache and in response to receiving thethird request, the second request from a request buffer for the L3cache; retrieving the first set of data from the L2 cache; receiving afourth request for a second set of data in the L2 cache; predicting, bylogic in the L2 cache, a location of the second set of data in the L2cache; predicting that there will be a cache hit in the L2 cache for thefourth request; transmitting, in response to predicting that there willbe a cache hit in the L2 cache for the fourth request, the fourthrequest to a request buffer for the L2 cache; determining, via a seconddirectory lookup, that there is an actual L2 cache miss for the secondset of data; sending, in response to determining that a directoryindicates the actual L2 cache miss for the second set of data, thefourth request from the request buffer to a request arbiter for the L2cache; transmitting the fourth request from the request arbiter to theL3 cache; and retrieving the second set of set of data from the L3cache.